dynamic treatment regime
EpiCare: A Reinforcement Learning Benchmark for Dynamic Treatment Regimes
Healthcare applications pose significant challenges to existing reinforcement learning (RL) methods due to implementation risks, low data availability, short treatment episodes, sparse rewards, partial observations, and heterogeneous treatment effects. Despite significant interest in using RL to generate dynamic treatment regimes for longitudinal patient care scenarios, no standardized benchmark has yet been developed.To fill this need we introduce (), a benchmark designed to mimic the challenges associated with applying RL to longitudinal healthcare settings. We leverage this benchmark to test five state-of-the-art offline RL models as well as five common off-policy evaluation (OPE) techniques.Our results suggest that while offline RL may be capable of improving upon existing standards of care given large data availability, its applicability does not appear to extend to the moderate to low data regimes typical of healthcare settings. Additionally, we demonstrate that several OPE techniques which have become standard in the the medical RL literature fail to perform adequately on our benchmark. These results suggest that the performance of RL models in dynamic treatment regimes may be difficult to meaningfully evaluate using current OPE methods, indicating that RL for this application may still be in its early stages. We hope that these results along with the benchmark itself will facilitate the comparison of existing methods and inspire further research into techniques that increase the practical applicability of medical RL.
Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes
A dynamic treatment regime (DTR) consists of a sequence of decision rules, one per stage of intervention, that dictates how to determine the treatment assignment to patients based on evolving treatments and covariates' history. These regimes are particularly effective for managing chronic disorders and is arguably one of the key aspects towards more personalized decision-making. In this paper, we investigate the online reinforcement learning (RL) problem for selecting optimal DTRs provided that observational data is available. We develop the first adaptive algorithm that achieves near-optimal regret in DTRs in online settings, without any access to historical data. We further derive informative bounds on the system dynamics of the underlying DTR from confounded, observational data. Finally, we combine these results and develop a novel RL algorithm that efficiently learns the optimal DTR while leveraging the abundant, yet imperfect confounded observations.
On Multiple Robustness of Proximal Dynamic Treatment Regimes
Gao, Yuanshan, Bai, Yang, Cui, Yifan
Dynamic treatment regimes are sequential decision rules that adapt treatment according to individual time-varying characteristics and outcomes to achieve optimal effects, with applications in precision medicine, personalized recommendations, and dynamic marketing. Estimating optimal dynamic treatment regimes via sequential randomized trials might face costly and ethical hurdles, often necessitating the use of historical observational data. In this work, we utilize proximal causal inference framework for learning optimal dynamic treatment regimes when the unconfoundedness assumption fails. Our contributions are four-fold: (i) we propose three nonparametric identification methods for optimal dynamic treatment regimes; (ii) we establish the semiparametric efficiency bound for the value function of a given regime; (iii) we propose a (K+1)-robust method for learning optimal dynamic treatment regimes, where K is the number of stages; (iv) as a by-product for marginal structural models, we establish identification and estimation of counterfactual means under a static regime. Numerical experiments validate the efficiency and multiple robustness of our proposed methods.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- (4 more...)
- Research Report > Experimental Study (0.87)
- Research Report > Strength High (0.65)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.45)
- North America > United States > Oregon > Benton County > Corvallis (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Counterfactual Survival Q Learning for Longitudinal Randomized Trials via Buckley James Boosting
We propose a Buckley James (BJ) Boost Q learning framework for estimating optimal dynamic treatment regimes under right censored survival data, tailored for longitudinal randomized clinical trial settings. The method integrates accelerated failure time models with iterative boosting techniques, including componentwise least squares and regression trees, within a counterfactual Q learning framework. By directly modeling conditional survival time, BJ Boost Q learning avoids the restrictive proportional hazards assumption and enables unbiased estimation of stage specific Q functions. Grounded in potential outcomes, this framework ensures identifiability of the optimal treatment regime under standard causal assumptions. Compared to Cox based Q learning, which relies on hazard modeling and may suffer from bias under misspecification, our approach provides robust and flexible estimation. Simulation studies and analysis of the ACTG175 HIV trial demonstrate that BJ Boost Q learning yields higher accuracy in treatment decision making, especially in multistage settings where bias can accumulate.
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > United States > Minnesota (0.04)
- Research Report > Strength High (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology > HIV (0.34)
SAFER: A Calibrated Risk-Aware Multimodal Recommendation Model for Dynamic Treatment Regimes
Shen, Yishan, Ye, Yuyang, Xiong, Hui, Chen, Yong
Dynamic treatment regimes (DTRs) are critical to precision medicine, optimizing long-term outcomes through personalized, real-time decision-making in evolving clinical contexts, but require careful supervision for unsafe treatment risks. Existing efforts rely primarily on clinician-prescribed gold standards despite the absence of a known optimal strategy, and predominantly using structured EHR data without extracting valuable insights from clinical notes, limiting their reliability for treatment recommendations. In this work, we introduce SAFER, a calibrated risk-aware tabular-language recommendation framework for DTR that integrates both structured EHR and clinical notes, enabling them to learn from each other, and addresses inherent label uncertainty by assuming ambiguous optimal treatment solution for deceased patients. Moreover, SAFER employs conformal prediction to provide statistical guarantees, ensuring safe treatment recommendations while filtering out uncertain predictions. Experiments on two publicly available sepsis datasets demonstrate that SAFER outperforms state-of-the-art baselines across multiple recommendation metrics and counterfactual mortality rate, while offering robust formal assurances. These findings underscore SAFER potential as a trustworthy and theoretically grounded solution for high-stakes DTR applications.
- North America > United States > Pennsylvania (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
EpiCare: A Reinforcement Learning Benchmark for Dynamic Treatment Regimes
Healthcare applications pose significant challenges to existing reinforcement learning (RL) methods due to implementation risks, low data availability, short treatment episodes, sparse rewards, partial observations, and heterogeneous treatment effects. Despite significant interest in using RL to generate dynamic treatment regimes for longitudinal patient care scenarios, no standardized benchmark has yet been developed.To fill this need we introduce Episodes of Care (EpiCare), a benchmark designed to mimic the challenges associated with applying RL to longitudinal healthcare settings. We leverage this benchmark to test five state-of-the-art offline RL models as well as five common off-policy evaluation (OPE) techniques.Our results suggest that while offline RL may be capable of improving upon existing standards of care given large data availability, its applicability does not appear to extend to the moderate to low data regimes typical of healthcare settings. Additionally, we demonstrate that several OPE techniques which have become standard in the the medical RL literature fail to perform adequately on our benchmark. These results suggest that the performance of RL models in dynamic treatment regimes may be difficult to meaningfully evaluate using current OPE methods, indicating that RL for this application may still be in its early stages.
Censoring-Aware Tree-Based Reinforcement Learning for Estimating Dynamic Treatment Regimes with Censored Outcomes
Paul, Animesh Kumar, Greiner, Russell
Dynamic Treatment Regimes (DTRs) provide a systematic approach for making sequential treatment decisions that adapt to individual patient characteristics, particularly in clinical contexts where survival outcomes are of interest. Censoring-Aware Tree-Based Reinforcement Learning (CA-TRL) is a novel framework to address the complexities associated with censored data when estimating optimal DTRs. We explore ways to learn effective DTRs, from observational data. By enhancing traditional tree-based reinforcement learning methods with augmented inverse probability weighting (AIPW) and censoring-aware modifications, CA-TRL delivers robust and interpretable treatment strategies. We demonstrate its effectiveness through extensive simulations and real-world applications using the SANAD epilepsy dataset, where it outperformed the recently proposed ASCL method in key metrics such as restricted mean survival time (RMST) and decision-making accuracy. This work represents a step forward in advancing personalized and data-driven treatment strategies across diverse healthcare settings.
- North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.93)
Reviews: Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes
Update after rebuttal: Due to author comments and, in particular, discussions with the other reviewers, I have updated my score from 4 to a weak accept 6. For the future draft, aside from the revisions and clarifications the authors have promised in the rebuttal, I recommend the following (slight) modifications to improve the manuscript: The motivation in the introduction would be strengthened by drawing clearer connections to the real world. The authors should consider picking a specific real world example and illustrating the method through that example (even if it's not possible to provide simulation results on such an example). In line with this, the authors should be careful about discussion of safe-RL. Typically such methods involve use of constraints to ensure safety, but it does not appear the authors explicitly use or discuss such methods here.
Reviews: Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes
In this paper, the authors provide a method for incorporating observational data (possibly subject to unobserved confounding) to improve the performance of policy learning in online settings (crucial theorems are 5,7 and 8). After a period of discussion, the reviewers came to a consensus that this paper merits publication in NeurIPS, and will contribute to the RL literature by giving a principled method of incorporating observational data, even if confounded.